load(here("Data/sim_H0_ksize.RData")) # filter size
load(here("Data/wt_exp_mat.RData"))
load(here("Data/sim_H0_expwt.RData")) # exponential weight
load(here("Data/wt_type.RData"))
load(here("Data/sim_H0_wt_type.RData")) # shape of weight
# df_wt_type$wt_type <- factor(df_wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))
# load(here("Data/sim_H0_stride.RData"))
N <- 100 # 100 subjects for block bootstrap
M <- 1000
N_vec <- c(100, 200, 500, 1000) # sample size used to test simple linear regression
# block size in block bootstrap
max_size <- 9
# image size
nS <- 32
# varying parameters used for data generation
ksize_vec <- unique(df_ksize$ksize)
alpha_vec <- unique(df_expwt$alpha)
wt_type_vec <- unique(df_wt_type$wt_type)
The simulation, technically, would be established on a continuous space-time domain, even though the final “observations” would be a discrete realization.
Take a Gaussian process for example:
\[Y_i(\mathbf{s}, t) = \eta_i(\mathbf{s}, t) +\epsilon_i(\mathbf{s}, t)\] 2. The true latent process is composed of a fixed process and a random (subject-specific) process.
\[\eta_i(\mathbf{s}, t) = \mu(\mathbf{s}, t)+b_i(\mathbf{s}, t)\]
\[\epsilon_i(\mathbf{s}, t) = \frac{1}{N_r}\sum_{\mathbf{s'} \in S_r}Z(\mathbf{s'}, t)\]
where:
Generate 2D/3D matrices that
The moving average filter from different area size or use different weights may lead to different variation in the generated image. I hererafter call the variance of generated pixel “baseline variance”.
First, if we look at the baseline variation a single generated MWA pixel, we get:
\[Var(X)=Var(\frac{\sum_{i,j}w_{ij}{z_{ij}}}{n})=\frac{\sum_{i,j}w_{ij}^2Var(z_{ij})}{n^2} = \frac{\sum_{i,j}w_{ij}^2}{n^2} \]
This for sure would change with the weight. To make a fair comparison, let’s try if we could scale the weights so that the generated data has the same baseline variation.
Let’s try a new weight \(w_{ij}'=w_{ij}\frac{n}{\sqrt{\sum w_{ij}^2}}\), \(n\) is the number of pixels in a filter.
In this section, we would like to see if the filter size of moving average would affect the spatial correlation. Here we generate 2D image of 32 by 32. Note that filter size should be odd numbers, a convention inherited from image analysis. Thought even size filter is technically possible, it is much more convenienve to use odd size for many operations since it would have a center pixel
df_ksize %>%
filter(id==1) %>%
ggplot()+
geom_tile(aes(x=s1, y=s2, fill=Y1))+
facet_wrap(~ksize, nrow = 1)
Below I plot the simple variance and variogram function at different kernel sizes. Variogram is a measure of spatial dependence as a function of distance. Smaller value indicates greater dependence.
df_ksize %>%
mutate(ksize = as.factor(ksize)) %>%
group_by(id, ksize) %>%
summarise(var = var(Y1), .groups = "keep") %>%
ggplot(aes(x=ksize, y=var, group = ksize))+
geom_boxplot()+
geom_jitter(size = 0.2)+
geom_hline(yintercept = 1, col = "red")+
labs(title = "Individual variance")
# df_ksize %>%
# mutate(ksize = as.factor(ksize)) %>%
# group_by(ksize) %>%
# summarise(var = var(Y1), .groups = "keep")
Its interesting to see that individual variance actually dropped with larger filter size (greater spatial independence), even after scaling. If calculate variace over all subjects, they are all very close to 1. Spatial dependence probably got canceled out across many subjects.
df_ksize %>%
filter(id <= 5) %>%
mutate(ksize = as.factor(ksize), subj_id = id) %>%
group_by(subj_id, ksize) %>%
group_modify(~{variogram(Y1~1, location = ~s1+s2, data = .x)}) %>%
ggplot()+
geom_line(aes(x=dist, y=gamma, col=ksize))+
geom_point(aes(x=dist, y=gamma, col=ksize))+
geom_hline(yintercept = 1, linetype = "dashed")+
facet_wrap(~subj_id, nrow = 1)+
labs(y="Sample variagram")
Some observations:
Here, I would like to use weighted average within a filter. Let’s fix the kernel size to be 5, and explore effect of difference weight matrix.
I would like to construct a weight matrix that decay fast from center to border. Let’s try using inverse of the exponential of the square of Euclidean distance to center. The weights are scaled such that they are sum to one within the filter. The scaling keep the overall variation of the generated outcome comparable to the other case.
\[w_{ij} \propto exp(-\alpha d_{ij}^2)\]
df_wt <- expand.grid(s2=1:5, s1=1:5)
for(m in seq_along(alpha_vec)){
df_wt[, paste0("alpha", alpha_vec[m])] <- as.vector(wt_exp[,,m])
}
df_wt %>%
pivot_longer(starts_with("alpha")) %>%
ggplot()+
geom_tile(aes(x=s1, y=s2, fill=value))+
facet_wrap(~name, nrow = 1)+
labs(title = "Weight matrix")
Here, when \(\alpha=5\), the boarder weight can get very, very close to zero (corner pixel: 4.14e-18). The center pixel takes 97.4% of the weight. When \(\alpha=0\), the weight matrix gets us a simple average. As \(\alpha\) increase, the filter put more weight on the center pixel and less weight on the boarder pixels, leading to lower spatial dependence.
df_expwt %>%
filter(id==1) %>%
ggplot()+
geom_tile(aes(x=s1, y=s2, fill=Y1))+
facet_wrap(~alpha, nrow = 1)+
labs(title = "Generated data")
df_expwt %>%
mutate(alpha = as.factor(alpha)) %>%
group_by(id, alpha) %>%
summarise(var = var(Y1), .groups = "keep") %>%
ggplot(aes(x=alpha, y=var, group = alpha))+
geom_boxplot()+
geom_jitter(size = 0.2)+
geom_hline(yintercept = 1, col = "red")+
labs(title = "Individual variance")
# df_expwt %>%
# filter(id == 1) %>%
# mutate(alpha = as.factor(alpha)) %>%
# group_by(alpha) %>%
# summarise(var = var(Y1), .groups = "keep")
df_expwt %>%
filter(id %in% 1:5) %>%
mutate(alpha=as.factor(alpha), subj_id = id) %>%
group_by(alpha, subj_id) %>%
group_modify(~{variogram(Y1~1, location = ~s1+s2, data = .x)}) %>%
ggplot()+
geom_line(aes(x=dist, y=gamma, col=alpha))+
geom_point(aes(x=dist, y=gamma, col=alpha))+
geom_hline(yintercept = 1, linetype = "dashed")+
facet_wrap(~subj_id, nrow = 1)+
labs(y="Sample variagram for five subjects")
How come when \(\alpha=1\), the
semivariogram can go over 1? This weight introduced greater variation
(noise) to the data? But it does not look that way in the individual
variance plot. The others ended up flatten out at 1.
The variograms from Amy took on many wild shapes! It seems that semivariance structure varies greatly across lesions, though the “peak” shape is pretty common. I am not sure how I’d be able to generate all these different spatial correlations. Intuitively, I’d like to start with different types of weights, with different functional shape wrt to Euclidean distance.
Let’s fix the filter size to be 5 by 5, and then create some different types of filters
Note that the trigonometric functions have negative weight.
Here I can’t really scaled the weights as before. Since for the trigonometric functions, the sum of weights is likely to be very close to zero, causing the magnitude of scaled weight to be inflated dramatically. The generated data, then, will not have a comparable variation for sure. For example, the sum of cosine weights would be -0.47, and the scaled weight will have a range of (-53.12, 53.12).
wt_type %>%
pivot_longer(starts_with("wt_")) %>%
mutate(name=factor(name, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
ggplot(aes(x=s1, y=s2, fill = value))+
geom_tile()+
facet_wrap(~name, nrow = 1)+
labs(title = "Weight matrix")
wt_type %>%
pivot_longer(starts_with("wt_")) %>%
mutate(name=factor(name, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
ggplot(aes(x=dist, y= value))+
geom_line()+
geom_point()+
facet_wrap(~name, nrow = 1)+
labs(title = "Weight as a function of distance")
df_wt_type %>%
filter(id==1) %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
ggplot(aes(x=s1, y=s2, fill=Y1))+
geom_tile()+
facet_wrap(~wt_type, nrow = 1)+
labs(title = "Example of generated data")
Simple variance of generated data:
df_wt_type %>%
group_by(wt_type, id) %>%
summarise(var_Y1=var(Y1), .groups = "keep") %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
ggplot(aes(x=wt_type, y = var_Y1, group=wt_type))+
geom_boxplot()+
geom_hline(yintercept = 1, col = "red")+
geom_jitter(size = 0.2)
Based on previous sections, I think dec, inc and u shape induced stronger spatial correlation.
df_wt_type %>%
filter(id<=5) %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos")),
subj_id = id) %>%
group_by(subj_id, wt_type) %>%
group_modify(~{variogram(Y1 ~ 1, locations = ~s1+s2, data = .x)}) %>%
ggplot(aes(x=dist, y=gamma, col=wt_type))+
geom_line()+
geom_point()+
facet_wrap(~subj_id, nrow = 1)+
labs(y="Semivariance")
Some of my interpretation:
As distance increases, the spatial correlation eventually fades, so the curves stayed flat at about the baseline variation level. The major difference lies in the shorter distance part.
Follow up on last time, we would like to generate data from the null hypothesis where \(Y_1\), \(Y_2\) are not correlated with each other. We will also remove any fixed/random effect, leaving only the moving average error, in case any unexpected correlation is induced
We generate data from the following scheme:
\[\begin{aligned} Y_{1i}(s_1, s_2, t) &=\epsilon_{1i}(s_1, s_2, t) \\ Y_{2i}(s_1, s_2, t) &=\epsilon_{2i}(s_1, s_2, t) \\ \end{aligned}\]
df_ksize %>%
filter(id==15) %>%
pivot_longer(starts_with("Y")) %>%
ggplot()+
geom_tile(aes(x=s1, y=s2, fill = value))+
facet_grid(cols=vars(ksize), rows = vars(name))+
labs(title = "Generated data of subject ID = 15")
df_ksize %>%
filter(id %in% 1:4) %>%
ggplot(aes(x=Y1, y=Y2))+
geom_point(size = 0.2)+
geom_smooth(formula = 'y~x', method = "lm")+
stat_cor(method = "pearson")+
facet_grid(rows = vars(id), cols = vars(ksize))+
labs(title = "Perason correlation of four subject")
When greater spatial correlation is induced by larger filter, more significant correlation was (falsely) detected.
We fit a simple linear model across space conditioning on each subject and time:
\[Y_{2i}(\mathbf{s}|t) = \beta_{it0}+\beta_{it1}Y_{1i}(\mathbf{s}|t)+\epsilon_i(\mathbf{s}|t)\]
lm_err <- data.frame(ksize = unique(df_ksize$ksize))
for(n in seq_along(N_vec)){
err_n <- df_ksize %>%
filter(id <= N_vec[n]) %>%
group_by(id, ksize) %>%
group_modify(~{
fit_lm <- lm(Y2~Y1, data = .x)
data.frame(beta = coef(fit_lm)["Y1"],
pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
}) %>%
mutate(reject = pval < 0.05) %>%
group_by(ksize) %>%
summarise_at("reject", mean)
lm_err[, paste0("N", N_vec[n])] <- err_n$reject
}
lm_err %>%
rename("Filter size" = "ksize") %>%
kable(table.attr = "style=\"color:black;\"") %>%
kable_styling(full_width = F) %>%
add_header_above(c(" " = 1, "Type I error" = "4"))
| Filter size | N100 | N200 | N500 | N1000 |
|---|---|---|---|---|
| 1 | 0.07 | 0.060 | 0.058 | 0.056 |
| 3 | 0.42 | 0.365 | 0.336 | 0.331 |
| 5 | 0.53 | 0.545 | 0.554 | 0.564 |
| 7 | 0.64 | 0.670 | 0.692 | 0.674 |
| 9 | 0.73 | 0.755 | 0.756 | 0.743 |
After increasing the sample size to 1000, the k=1 case had 4.4% type I error, still slightly lower than the nominal value of 5%. Does it indicate that data generated under this scheme is more likely to have false rejection under simple linear regression? Would it be because of the spatial correlation, such as the spatial correlation introduced some kind of correlation between \(Y_1\) and \(Y_2\)?
We hope to bootstrap non-overlapping blocks of pixels to preserve some degree of correlation. While the image may not be evenly divided by the block size, we try to make the expectation of image size constant, so that each block will be sampled with equal probability. For the re-sampled image, regardless of its size, we will fit a simple linear regression model for each subject and examine the significance of the slope.
PS. Originally, we wanted to set the resampling probability to be propotional to image size. But in fact, I think this approach couldn’t make the expectation of the pixels as the same as the original image. It would keep the expected size of the sampled image unchanged if all blocks are sampled with equal probability.
Let’s say the original image has N pixels and B blocks, and we want to resample a new image with also B blocks:
\[E(N_{new}) = E(\sum_{b=1}^Bn_b^{new}) = \sum_{b=1}^BE(n_b^{new})=\sum_{b=1}^B\sum_{b=1}^Bn_bp_b = \sum_{b=1}^B\sum_{b=1}^B n_b\frac{n_b}{N}=B\frac{\sum_{b=1}^Bn_B^2}{N} \] If we set \(p_b = 1/B\), then
\[E(N_{new}) = E(\sum_{b=1}^Bn_b^{new}) = \sum_{b=1}^BE(n_b^{new})=\sum_{b=1}^B\sum_{b=1}^Bn_bp_b = \sum_{b=1}^B\sum_{b=1}^B n_b\frac{1}{B}=\sum_{b=1}^B\frac{N}{B} = N \]
load(here("Data/block_boot_H0_ksize.RData"))
ksize_vec <- unique(df_ksize$ksize)
# calculate Type I error
lqt_mat <- apply(slope_est_ksize[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat <- apply(slope_est_ksize[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
reject_mat <- 0 < lqt_mat | uqt_mat < 0
typeIerror <- apply(reject_mat, 1:2, mean)
typeIerror <- data.frame(ksize = ksize_vec, typeIerror)
typeIerror %>% pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
ggplot()+
geom_line(aes(x=bsize, y=value, group=as.factor(ksize), col=as.factor(ksize)))+
geom_point(aes(x=bsize, y=value, group=as.factor(ksize), col=as.factor(ksize)))+
geom_hline(yintercept = 0.05, col = "black", linetype = "dashed")+
labs(x="Block size", y="Type I error", col = "Filter size")+
scale_x_continuous(breaks = 1:max_size)
After scaling, the endpoint of type I error is not as different as before across filter sizes. But the trend stays the same. Larger filter size led to higher type I error.
Below I show the mean and standard deviation of bootstrap slope estimates for all 100 subjects (each point represents a subject).
# mean slope estimates
slope_mean1 <- apply(slope_est_ksize, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean1, perm = c(1, 3, 2)),
dim =c(N*length(ksize_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(ksize_vec)),
ksize = rep(ksize_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~ksize, ncol = 1)+
labs(x="Block size", title="Bootstrap mean of slope estimates")+
scale_x_continuous(breaks = 1:max_size)+
geom_hline(yintercept = 0, col="red", linetype = "dashed")
# mean slope estimates
slope_sd1 <- apply(slope_est_ksize, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd1, perm = c(1, 3, 2)),
dim =c(N*length(ksize_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(ksize_vec)),
ksize = rep(ksize_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~ksize, ncol = 1)+
labs(x="Block size", title="Standard deviation of bootstrap estimates")+
scale_x_continuous(breaks = 1:max_size)
Below I also showed an example of the bootstrap slope estimates of one subject.
array(slope_est_ksize[,,15,], dim = c(length(ksize_vec)*max_size, M)) %>%
data.frame() %>%
mutate(ksize = rep(ksize_vec, max_size),
bsize = rep(1:max_size, each = length(ksize_vec)),
.before = 1) %>%
pivot_longer(starts_with("X")) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group=bsize))+
geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
facet_wrap(~ksize, ncol=1)+
labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
geom_hline(yintercept = 0, col="red", linetype = "dashed")
Here the two parameters of interest are the weight matrix (in data generation process) and the block size (in bootstrap process). I will use the same weight matrix as Section 1.2, and the same block size as Section 2.1.
Just like before, I will generate data for 1000 subject, but only use 100 for the block bootstrap portion. The filter size for generating data is fixed at 5.
df_expwt %>%
filter(id==15) %>%
mutate(alpha=as.factor(alpha)) %>%
pivot_longer(starts_with("Y")) %>%
ggplot()+
geom_tile(aes(x=s1, y=s2, fill = value))+
facet_grid(cols=vars(alpha), rows = vars(name))+
labs(title = "Generated data of subject ID = 15")
df_expwt %>%
filter(id %in% 1:4) %>%
mutate(alpha=as.factor(alpha)) %>%
ggplot(aes(x=Y1, y=Y2))+
geom_point(size = 0.2)+
geom_smooth(formula = 'y~x', method = "lm")+
stat_cor(method = "pearson")+
facet_grid(rows = vars(id), cols = vars(alpha))+
labs(title = "Perason correlation of four subject")
lm_err2 <- data.frame(alpha = alpha_vec)
# N_vec <- c(100, 200, 500, 1000)
for(n in seq_along(N_vec)){
err_n <- df_expwt %>%
filter(id <= N_vec[n]) %>%
group_by(id, alpha) %>%
group_modify(~{
fit_lm <- lm(Y2~Y1, data = .x)
data.frame(beta = coef(fit_lm)["Y1"],
pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
}) %>%
mutate(reject = pval < 0.05) %>%
group_by(alpha) %>%
summarise_at("reject", mean)
lm_err2[, paste0("N", N_vec[n])] <- err_n$reject
}
lm_err2 %>%
# rename("Alpha" = "ksize") %>%
# arrange(desc(alpha)) %>%
mutate(alpha=fct_rev(as.factor(alpha))) %>%
kable(table.attr = "style=\"color:black;\"") %>%
kable_styling(full_width = F) %>%
add_header_above(c(" " = 1, "Type I error" = "4"))
| alpha | N100 | N200 | N500 | N1000 |
|---|---|---|---|---|
| 0 | 0.52 | 0.545 | 0.526 | 0.531 |
| 0.5 | 0.35 | 0.355 | 0.418 | 0.438 |
| 1 | 0.34 | 0.310 | 0.272 | 0.256 |
| 2.5 | 0.13 | 0.120 | 0.106 | 0.083 |
| 5 | 0.10 | 0.075 | 0.066 | 0.057 |
load(here("Data/block_boot_H0_expwt.RData"))
# calculate Type I error
lqt_mat2 <- apply(slope_est_expwt[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat2 <- apply(slope_est_expwt[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
reject_mat2 <- 0 < lqt_mat2 | uqt_mat2 < 0
typeIerror2 <- apply(reject_mat2, 1:2, mean)
typeIerror2 <- data.frame(alpha = alpha_vec, typeIerror2)
typeIerror2 %>% pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(alpha=fct_rev(as.factor(alpha))) %>%
ggplot()+
geom_line(aes(x=bsize, y=value, group=as.factor(alpha), col=as.factor(alpha)))+
geom_point(aes(x=bsize, y=value, group=as.factor(alpha), col=as.factor(alpha)))+
geom_hline(yintercept = 0.05, col = "black", linetype = "dashed")+
labs(x="Block size", y="Type I error", col = "Weight matrix")+
scale_x_continuous(breaks = 1:max_size)
# mean slope estimates
slope_mean2 <- apply(slope_est_expwt, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean2, perm = c(1, 3, 2)),
dim =c(N*length(alpha_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(alpha_vec)),
alpha = rep(alpha_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(alpha=fct_rev(as.factor(alpha))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~alpha, ncol = 1)+
labs(x="Block size", title="Bootstrap mean of slope estimates")+
scale_x_continuous(breaks = 1:max_size)+
geom_hline(yintercept = 0, col="red", linetype = "dashed")
# mean slope estimates
slope_sd2 <- apply(slope_est_expwt, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd2, perm = c(1, 3, 2)),
dim =c(N*length(alpha_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(alpha_vec)),
alpha = rep(alpha_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(alpha=fct_rev(as.factor(alpha))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~alpha, ncol = 1)+
labs(x="Block size", title="Standard deviation of bootstrap estimates")+
scale_x_continuous(breaks = 1:max_size)
array(slope_est_expwt[,,15,], dim = c(length(alpha_vec)*max_size, M)) %>%
data.frame() %>%
mutate(alpha = rep(alpha_vec, max_size),
bsize = rep(1:max_size, each = length(alpha_vec)),
.before = 1) %>%
pivot_longer(starts_with("X")) %>%
mutate(alpha=fct_rev(as.factor(alpha))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group=bsize))+
geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
facet_wrap(~alpha, ncol=1)+
labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
geom_hline(yintercept = 0, col="red", linetype = "dashed")
In this section, filter size is fixed at 5. Data is generated from different functional shape of weight function (wrt) distance
df_wt_type %>%
filter(id==15) %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
pivot_longer(starts_with("Y")) %>%
ggplot()+
geom_tile(aes(x=s1, y=s2, fill = value))+
facet_grid(cols=vars(wt_type), rows = vars(name))+
labs(title = "Generated data of subject ID = 15")
df_wt_type %>%
filter(id %in% 1:4) %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
ggplot(aes(x=Y1, y=Y2))+
geom_point(size = 0.2)+
geom_smooth(formula = 'y~x', method = "lm")+
stat_cor(method = "pearson")+
facet_grid(rows = vars(id), cols = vars(wt_type))+
labs(title = "Perason correlation of four subject")
lm_err3 <- data.frame(wt_type=unique(df_wt_type$wt_type))
# N_vec <- c(100, 200, 500, 1000)
for(n in seq_along(N_vec)){
err_n <- df_wt_type %>%
filter(id <= N_vec[n]) %>%
group_by(id, wt_type) %>%
group_modify(~{
fit_lm <- lm(Y2~Y1, data = .x)
data.frame(beta = coef(fit_lm)["Y1"],
pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
}) %>%
mutate(reject = pval < 0.05) %>%
group_by(wt_type) %>%
summarise_at("reject", mean)
lm_err3[, paste0("N", N_vec[n])] <- err_n$reject
}
lm_err3 %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
kable(table.attr = "style=\"color:black;\"") %>%
kable_styling(full_width = F) %>%
add_header_above(c(" " = 1, "Type I error" = "4"))
| wt_type | N100 | N200 | N500 | N1000 |
|---|---|---|---|---|
| wt_inc | 0.10 | 0.145 | 0.162 | 0.165 |
| wt_dec | 0.45 | 0.455 | 0.418 | 0.404 |
| wt_u | 0.34 | 0.415 | 0.394 | 0.394 |
| wt_sin | 0.12 | 0.120 | 0.138 | 0.126 |
| wt_cos | 0.31 | 0.285 | 0.276 | 0.266 |
load(here("Data/block_boot_H0_wt_type.RData"))
# calculate Type I error
lqt_mat3 <- apply(slope_est_wt_type[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat3 <- apply(slope_est_wt_type[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
reject_mat3 <- 0 < lqt_mat3 | uqt_mat3 < 0
typeIerror3 <- apply(reject_mat3, 1:2, mean)
typeIerror3 <- data.frame(wt_type = wt_type_vec, typeIerror3)
typeIerror3 %>% pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
ggplot()+
geom_line(aes(x=bsize, y=value, group=wt_type, col=wt_type))+
geom_point(aes(x=bsize, y=value, group=wt_type, col=wt_type))+
geom_hline(yintercept = 0.05, col = "black", linetype = "dashed")+
labs(x="Block size", y="Type I error", col = "Weight matrix")+
scale_x_continuous(breaks = 1:max_size)
# mean slope estimates
slope_mean3 <- apply(slope_est_wt_type, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean3, perm = c(1, 3, 2)),
dim =c(N*length(wt_type_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(wt_type_vec)),
wt_type = rep(wt_type_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~wt_type, ncol = 1)+
labs(x="Block size", title="Bootstrap mean of slope estimates")+
scale_x_continuous(breaks = 1:max_size)+
geom_hline(yintercept = 0, col="red", linetype = "dashed")
# mean slope estimates
slope_sd3 <- apply(slope_est_wt_type, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd3, perm = c(1, 3, 2)),
dim =c(N*length(wt_type_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(wt_type_vec)),
wt_type = rep(wt_type_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~wt_type, ncol = 1)+
labs(x="Block size", title="Standard deviation of bootstrap estimates")+
scale_x_continuous(breaks = 1:max_size)
array(slope_est_wt_type[,,15,], dim = c(length(wt_type_vec)*max_size, M)) %>%
data.frame() %>%
mutate(wt_type = rep(wt_type_vec, max_size),
bsize = rep(1:max_size, each = length(wt_type_vec)),
.before = 1) %>%
pivot_longer(starts_with("X")) %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group=bsize))+
geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
facet_wrap(~wt_type, ncol=1)+
labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
geom_hline(yintercept = 0, col="red", linetype = "dashed")
In this section, we wish to generate two spatial-dependent variables that are correlated with each other. Let’s say our generation is also based only on moving average of white noise field, we update the data generation scheme as follows:
\[\begin{aligned} Y_{1i}(s_1, s_2, t) &=\epsilon_{1i}(s_1, s_2, t) \\ Y_{2i}(s_1, s_2, t) &=\beta_0+\beta_1 Y_{1i}(s_1, s_2, t) + \epsilon_{2i}(s_1, s_2, t) \\ \end{aligned}\]
load(here("Data/sim_H1_ksize.RData"))
# ksize_vec <- unique(df_ksize_h1$ksize)
load(here("Data/sim_H1_expwt.RData"))
load(here("Data/sim_H1_wt_type.RData"))
df_ksize_h1 %>%
filter(id==15) %>%
pivot_longer(starts_with("Y")) %>%
ggplot()+
geom_tile(aes(x=s1, y=s2, fill = value))+
facet_grid(cols=vars(ksize), rows = vars(name))+
labs(title = "Generated data of subject ID = 15")
df_ksize_h1 %>%
filter(id %in% 1:4) %>%
ggplot(aes(x=Y1, y=Y2))+
geom_point(size = 0.2)+
geom_smooth(formula = 'y~x', method = "lm")+
stat_cor(method = "pearson")+
facet_grid(rows = vars(id), cols = vars(ksize))+
labs(title = "Perason correlation of four subject")
Since data is generated under the alternative hypothesis, we calculate power (true rejection rate) in this case.
It is expected that LM would end up with all rejection, considering the spatial correlation
lm_err1_h1 <- data.frame(ksize = ksize_vec)
for(n in seq_along(N_vec)){
err_n <- df_ksize_h1 %>%
filter(id <= N_vec[n]) %>%
group_by(id, ksize) %>%
group_modify(~{
fit_lm <- lm(Y2~Y1, data = .x)
data.frame(beta = coef(fit_lm)["Y1"],
pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
}) %>%
mutate(reject = pval < 0.05) %>%
group_by(ksize) %>%
summarise_at("reject", mean)
lm_err1_h1[, paste0("N", N_vec[n])] <- err_n$reject
}
lm_err1_h1 %>%
rename("Filter size" = "ksize") %>%
kable(table.attr = "style=\"color:black;\"") %>%
kable_styling(full_width = F) %>%
add_header_above(c(" " = 1, "Power" = "4"))
| Filter size | N100 | N200 | N500 | N1000 |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 |
| 3 | 1 | 1 | 1 | 1 |
| 5 | 1 | 1 | 1 | 1 |
| 7 | 1 | 1 | 1 | 1 |
| 9 | 1 | 1 | 1 | 1 |
load(here("Data/block_boot_H1_ksize.RData"))
# calculate Type I error
lqt_mat1_h1 <- apply(slope_est_ksize_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat1_h1 <- apply(slope_est_ksize_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
cover_mat1_h1 <- lqt_mat1_h1 < 1 & 1 < uqt_mat1_h1
cover_rate1_h1 <- apply(cover_mat1_h1, 1:2, mean)
cover_rate1_h1 <- data.frame(ksize = ksize_vec, cover_rate1_h1)
cover_rate1_h1 %>% pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
ggplot()+
geom_line(aes(x=bsize, y=value, group=as.factor(ksize), col=as.factor(ksize)))+
geom_point(aes(x=bsize, y=value, group=as.factor(ksize), col=as.factor(ksize)))+
geom_hline(yintercept = 0.95, col = "black", linetype = "dashed")+
labs(x="Block size", y="Coverate rate", col = "Filter size")+
scale_x_continuous(breaks = 1:max_size)
Below I show the mean and standard deviation of bootstrap slope estimates for all 100 subjects (each point represents a subject).
# mean slope estimates
slope_mean1_h1 <- apply(slope_est_ksize_h1, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean1_h1, perm = c(1, 3, 2)),
dim =c(N*length(ksize_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(ksize_vec)),
ksize = rep(ksize_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~ksize, ncol = 1)+
labs(x="Block size", title="Bootstrap mean of slope estimates")+
scale_x_continuous(breaks = 1:max_size)+
geom_hline(yintercept = 1, col="red", linetype = "dashed")
# mean slope estimates
slope_sd1_h1 <- apply(slope_est_ksize_h1, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd1_h1, perm = c(1, 3, 2)),
dim =c(N*length(ksize_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(ksize_vec)),
ksize = rep(ksize_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~ksize, ncol = 1)+
labs(x="Block size", title="Standard deviation of bootstrap estimates")+
scale_x_continuous(breaks = 1:max_size)
Below I also showed an example of the bootstrap slope estimates of one subject.
array(slope_est_ksize_h1[,,15,], dim = c(length(ksize_vec)*max_size, M)) %>%
data.frame() %>%
mutate(ksize = rep(ksize_vec, max_size),
bsize = rep(1:max_size, each = length(ksize_vec)),
.before = 1) %>%
pivot_longer(starts_with("X")) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group=bsize))+
geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
facet_wrap(~ksize, ncol=1)+
labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
geom_hline(yintercept = 1, col="red", linetype = "dashed")
df_expwt_h1 %>%
filter(id==15) %>%
# mutate(alpha=fct_rev(as.factor(alpha))) %>%
pivot_longer(starts_with("Y")) %>%
ggplot()+
geom_tile(aes(x=s1, y=s2, fill = value))+
facet_grid(cols=vars(alpha), rows = vars(name))+
labs(title = "Generated data of subject ID = 15")
df_expwt_h1 %>%
# filter(id==1) %>%
filter(id %in% 1:4) %>%
# mutate(alpha=fct_rev(as.factor(alpha))) %>%
ggplot(aes(x=Y1, y=Y2))+
geom_point(size = 0.2)+
geom_smooth(formula = 'y~x', method = "lm")+
stat_cor(method = "pearson")+
facet_grid(rows = vars(id), cols = vars(alpha))+
labs(title = "Perason correlation of four subject")
lm_err2_h1 <- data.frame(alpha = alpha_vec)
# N_vec <- c(100, 200, 500, 1000)
for(n in seq_along(N_vec)){
err_n <- df_expwt_h1 %>%
filter(id <= N_vec[n]) %>%
group_by(id, alpha) %>%
group_modify(~{
fit_lm <- lm(Y2~Y1, data = .x)
data.frame(beta = coef(fit_lm)["Y1"],
pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
}) %>%
mutate(reject = pval < 0.05) %>%
group_by(alpha) %>%
summarise_at("reject", mean)
lm_err2_h1[, paste0("N", N_vec[n])] <- err_n$reject
}
lm_err2_h1 %>%
# rename("Alpha" = "ksize") %>%
# arrange(desc(alpha)) %>%
kable(table.attr = "style=\"color:black;\"") %>%
kable_styling(full_width = F) %>%
add_header_above(c(" " = 1, "Power" = "4"))
| alpha | N100 | N200 | N500 | N1000 |
|---|---|---|---|---|
| 0.0 | 1 | 1 | 1 | 1 |
| 0.5 | 1 | 1 | 1 | 1 |
| 1.0 | 1 | 1 | 1 | 1 |
| 2.5 | 1 | 1 | 1 | 1 |
| 5.0 | 1 | 1 | 1 | 1 |
We will follow the same set up as Section 2.2.
load(here("Data/block_boot_H1_expwt.RData"))
# calculate coverage rate
lqt_mat2_h1 <- apply(slope_est_expwt_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat2_h1 <- apply(slope_est_expwt_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
cover_mat2_h1 <- lqt_mat2_h1 < 1 & 1 < uqt_mat2_h1
cover_rate2_h1 <- apply(cover_mat2_h1, 1:2, mean)
cover_rate2_h1 <- data.frame(alpha = alpha_vec, cover_rate2_h1)
cover_rate2_h1 %>% pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
# mutate(alpha=as.factor(alpha)) %>%
ggplot()+
geom_line(aes(x=bsize, y=value, group=as.factor(alpha), col=as.factor(alpha)))+
geom_point(aes(x=bsize, y=value, group=as.factor(alpha), col=as.factor(alpha)))+
geom_hline(yintercept = 0.95, col = "black", linetype = "dashed")+
labs(x="Block size", y="Coverate rate", col = "Weight matrix")+
scale_x_continuous(breaks = 1:max_size)
Below I show the mean and standard deviation of bootstrap slope estimates for all 100 subjects (each point represents a subject).
# mean slope estimates
slope_mean2_h1 <- apply(slope_est_expwt_h1, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean2_h1, perm = c(1, 3, 2)),
dim =c(N*length(alpha_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(alpha_vec)),
alpha = rep(alpha_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(alpha=as.factor(alpha)) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~alpha, ncol = 1)+
labs(x="Block size", title="Bootstrap mean of slope estimates")+
scale_x_continuous(breaks = 1:max_size)+
geom_hline(yintercept = 1, col="red", linetype = "dashed")
New discovery: first row seems slightly biased! Does strong spatial correlation also affects the mean of bootstrap estimation?
# mean slope estimates
slope_sd2_h1 <- apply(slope_est_expwt_h1, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd2_h1, perm = c(1, 3, 2)),
dim =c(N*length(alpha_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(alpha_vec)),
alpha = rep(alpha_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(alpha=as.factor(alpha)) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~alpha, ncol = 1)+
labs(x="Block size", title="Standard deviation of bootstrap estimates")+
scale_x_continuous(breaks = 1:max_size)
Below I also showed an example of the bootstrap slope estimates of one subject.
array(slope_est_expwt_h1[,,15,], dim = c(length(alpha_vec)*max_size, M)) %>%
data.frame() %>%
mutate(alpha = rep(alpha_vec, max_size),
bsize = rep(1:max_size, each = length(alpha_vec)),
.before = 1) %>%
pivot_longer(starts_with("X")) %>%
mutate(alpha=as.factor(alpha)) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group=bsize))+
geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
facet_wrap(~alpha, ncol=1)+
labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
geom_hline(yintercept = 1, col="red", linetype = "dashed")
df_wt_type_h1 %>%
filter(id==15) %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
pivot_longer(starts_with("Y")) %>%
ggplot()+
geom_tile(aes(x=s1, y=s2, fill = value))+
facet_grid(cols=vars(wt_type), rows = vars(name))+
labs(title = "Generated data of subject ID = 15")
df_wt_type_h1 %>%
# filter(id==1) %>%
filter(id %in% 1:4) %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
ggplot(aes(x=Y1, y=Y2))+
geom_point(size = 0.2)+
geom_smooth(formula = 'y~x', method = "lm")+
stat_cor(method = "pearson")+
facet_grid(rows = vars(id), cols = vars(wt_type))+
labs(title = "Perason correlation of four subject")
lm_err3_h1 <- data.frame(wt_type = wt_type_vec)
# N_vec <- c(100, 200, 500, 1000)
for(n in seq_along(N_vec)){
err_n <- df_wt_type_h1 %>%
filter(id <= N_vec[n]) %>%
group_by(id, wt_type) %>%
group_modify(~{
fit_lm <- lm(Y2~Y1, data = .x)
data.frame(beta = coef(fit_lm)["Y1"],
pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
}) %>%
mutate(reject = pval < 0.05) %>%
group_by(wt_type) %>%
summarise_at("reject", mean)
lm_err3_h1[, paste0("N", N_vec[n])] <- err_n$reject
}
lm_err3_h1 %>%
mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
kable(table.attr = "style=\"color:black;\"") %>%
kable_styling(full_width = F) %>%
add_header_above(c(" " = 1, "Power" = "4"))
| wt_type | N100 | N200 | N500 | N1000 |
|---|---|---|---|---|
| wt_inc | 1 | 1 | 1 | 1 |
| wt_dec | 1 | 1 | 1 | 1 |
| wt_u | 1 | 1 | 1 | 1 |
| wt_sin | 1 | 1 | 1 | 1 |
| wt_cos | 1 | 1 | 1 | 1 |
load(here("Data/block_boot_H1_wt_type.RData"))
# calculate Type I error
lqt_mat3_h1 <- apply(slope_est_wttype_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat3_h1 <- apply(slope_est_wttype_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
cover_mat3_h1 <- lqt_mat3_h1 < 1 & 1 < uqt_mat3_h1
cover_rate3_h1 <- apply(cover_mat3_h1, 1:2, mean)
cover_rate3_h1 <- data.frame(wt_type = wt_type_vec, cover_rate3_h1)
cover_rate3_h1 %>% pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(wt_type=factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
ggplot()+
geom_line(aes(x=bsize, y=value, group=wt_type, col=wt_type))+
geom_point(aes(x=bsize, y=value, group=wt_type, col=wt_type))+
geom_hline(yintercept = 0.95, col = "black", linetype = "dashed")+
labs(x="Block size", y="Coverate rate", col = "Weight matrix")+
scale_x_continuous(breaks = 1:max_size)
# mean slope estimates
slope_mean3_h1 <- apply(slope_est_wttype_h1, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean3_h1, perm = c(1, 3, 2)),
dim =c(N*length(wt_type_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(wt_type_vec)),
wt_type = rep(wt_type_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(wt_type=factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~wt_type, ncol = 1)+
labs(x="Block size", title="Bootstrap mean of slope estimates")+
scale_x_continuous(breaks = 1:max_size)+
geom_hline(yintercept = 1, col="red", linetype = "dashed")
We also seem some bias in the first row (increasing weight)
# mean slope estimates
slope_sd3_h1 <- apply(slope_est_wttype_h1, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd3_h1, perm = c(1, 3, 2)),
dim =c(N*length(wt_type_vec), max_size)) %>%
data.frame() %>%
mutate(id = rep(1:N, each = length(wt_type_vec)),
wt_type = rep(wt_type_vec, N),
.before=1) %>%
pivot_longer(starts_with("X")) %>%
mutate(bsize = as.numeric(gsub("X", "", name))) %>%
mutate(wt_type=factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group = bsize))+
geom_jitter(aes(x=bsize, y=value), size = 0.2)+
facet_wrap(~wt_type, ncol = 1)+
labs(x="Block size", title="Standard deviation of bootstrap estimates")+
scale_x_continuous(breaks = 1:max_size)
Below I also showed an example of the bootstrap slope estimates of one subject.
array(slope_est_wttype_h1[,,15,], dim = c(length(wt_type_vec)*max_size, M)) %>%
data.frame() %>%
mutate(wt_type = rep(wt_type_vec, max_size),
bsize = rep(1:max_size, each = length(wt_type_vec)),
.before = 1) %>%
pivot_longer(starts_with("X")) %>%
mutate(wt_type=factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
ggplot()+
geom_boxplot(aes(x=bsize, y=value, group=bsize))+
geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
facet_wrap(~wt_type, ncol=1)+
labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
geom_hline(yintercept = 1, col="red", linetype = "dashed")
Why couldn’t the block bootstrap approach reach nominal type I error/coverage rate? One speculation is that through generating the two outcomes separately from the same mechanism to introduce spatial intercorrelation, we introduced correlation between the two variables.
To explore that, I wanna look into the residuals after subtracting the effect of \(Y_1\) on \(Y_2\), in the original image and the bootstrap samples.
Let’s start with one subject (subject 15). I will do the block bootstrap with different block size, and visualize the residual inter-correlation. This turned out to be quite a challenging problem, because to plot the semivariogram, we need each pixel to have a location index. And it is not straight forward to do that, putting the sampled blocks into the original 2D format.
My way is to create a 2D matrix that I am sure that can fit all the blocks, and put the blocks one by one into their slot, leaving some empty pixels here and there.
# subset subject 15
df_id15 <- df_ksize %>% filter(id == 15)
# # simple linear regression
# df_id15 %>%
# group_by(ksize) %>%
# group_modify(~{this_fit <- lm(Y2~Y1, data = .x)
# data.frame(res = this_fit$residuals,
# pred = this_fit$fitted.values)}) %>%
# ggplot(aes(x=pred, y=res))+
# geom_point(size = 0.2)+
# geom_smooth(formula = y~x)+
# facet_wrap(~ksize)
#
# try_fit <- lm(Y2~Y1, data = df_id15)
# try_fit$fitted.values
#
# lapply(lm_list, function(x){x$residuals}) %>%
# bind_rows()
Out of curiosity, I would like to look at the distribution of residuals in the presense of spatial correlation
# containers
res_list <- list()
for(k in seq_along(ksize_vec)){ # filter size for data generation
res_list[[k]] <- list()
# A matrix of observed outcome
this_df <- df_id15 %>% filter(ksize==ksize_vec[k])
this_df[, paste0("res_k", 1:max_size)] <- NA
Y_mat <- matrix(this_df$Y2, nS, nS)
for(b in 1:max_size){ # block size for bootstrap
# divide the matrix into blocks
rblock <- (row(Y_mat)-1)%/%b+1
cblock <- (col(Y_mat)-1)%/%b+1
block_id_mat <- (rblock-1)*max(cblock) + cblock
nblock <- max(block_id_mat)
# sample blocks
# sample the same number of blocks as the original image
this_df$block_id <- as.vector(block_id_mat)
block_list <- split(this_df, f = this_df$block_id)
# for(m in 1:M){
boot_block <- sample(1:nblock, size = nblock, replace = T)
boot_list <- block_list[boot_block]
names(boot_list) <- as.character(1:nblock)
boot_df <- bind_rows(boot_list, .id="new_block_id")
# fit model
boot_lm <- lm(Y2~Y1, data = boot_df)
# slope_est5[k, b] <- boot_lm$coefficients["Y1"]
boot_df$res <- boot_lm$residuals
new_boot_list <- split(boot_df, f=boot_df$new_block_id)
# put residuals back to its original 2D format
res_mat <- matrix(NA, nrow = sqrt(nblock)*b, ncol = sqrt(nblock)*b)
rblock2 <- (row(res_mat)-1)%/%b+1
cblock2 <- (col(res_mat)-1)%/%b+1
block_id_mat2 <- (rblock2-1)*max(cblock2) + cblock2
res_df <- data.frame(block_id_mat2) %>% mutate(s1 = 1:(sqrt(nblock)*b), .before=1) %>%
pivot_longer(starts_with("X"), names_to = "s2", values_to = "block_id") %>%
mutate(s2 = gsub("X", "", s2)) %>%
arrange(block_id, s2) %>%
mutate(res=NA)
for(l in 1:nblock){
res_df$res[res_df$block_id==l][1:nrow(new_boot_list[[l]])] <- new_boot_list[[l]]$res
}
res_list[[k]][[b]] <- res_df
}
}
t2 <- Sys.time()
# put the residuals back in the 2D image
for(k in seq_along(res_list)){
res_plt <- res_list[[k]] %>%
lapply(function(x){variogram(res~1, locations = ~s1+s2, data = x %>% filter(!is.na(res))
%>% mutate_all(as.numeric))}) %>%
bind_rows(.id="block_size") %>%
ggplot(aes(x=dist, y=gamma, col=as.factor(block_size)))+
geom_line()+
geom_point()+
labs(title = paste0("Filter size = ", ksize_vec[k]),
y = "Variogram", col = "Block size")
print(res_plt)
}
From the variogram of residuals, for individual subjects, strong spatial correlation remained after block bootstrap, and the size of block did not make much difference. I am still thinking about ways to integrally showing that for multiple bootstrap iterations of multiple subjects.
Once we do a full boostrap, for one subject at one filter size and one block size, we’ll have 1000 semivariogram curves (that is, basically, for each colored curve above, we’ll have 1000 copies). Can we just simply plot them and smooth them?